David J. Birnbaum (djbpitt@gmail.com, http://www.obdurodon.org), Last modified 2015-0y-07
This example collates ten full witnesses of Partonopeus de Blois (the files are available at the Oxford Text Archive; quasi-TEI XML files are in 2499/data/xml subdirectory of the zip file).
In Part 1 of this tutorial we collated just a single line from just four witnesses, spelling out the details step by step in a way that would not be used in a real project, but that made it easy to see how each step moves toward the final result. In Part 2 we employed three classes (WitnessSet, Line, Word) to make the code more extensible and adaptable. In Part 3 we enhance the processing by:
The markup in the input files is similar in some respects to TEI, but the root element is <part>
, obligatory TEI elements like <teiHeader>
and <text>
are not present, and the documents are in no namespace. Lines are tagged as <l>
, and each line has @id
and @n
attributes. The value of the @n
attribute refers to the order of the line within the individual witness, which is not relevant for collation. The @id
attribute, on the other hand, represents the line number in a synopsis of all witnesses, which means that, for example, the <l id='34'>
lines from all witnesses should be collated together, and similarly for other @id
values. This makes it easy to identify the segments to be treated as separate collation sets; we can collate all versions of line #1 against one another, and then, separately, collate all version of line #2 against one another, etc., ultimately concatenating the results. There are two peculiarities of the @id
values that are relevant here:
@id
numbers, we need to accommodate gaps in the data.@id
values are not only consecutive integers. Some values have appended letters, so that, for example, in witness G line 4008 is followed by 4008a and then 4009. This means that if we want to iterate over the @id
values in order, we cannot rely on either purely numeric or purely string order.Additionally, in Part 1 and Part 2 of this tutorial:
<l>
element. Now that we are dealing with full witnesses that contain multiple lines, we have to locate the siglum elsewhere. <l>
element, and we ignored the rest of the documents whence those single lines had been extracted manually. Now that we are dealing with complete TEI-based documents, we have to decide what to do with the rest of the content, that is, with the elements that are not just lines.In this tutorial we ignore the other elements of the input documents except for the siglum. In Real-Life collation tasks with complete TEI documents, developers would probably want to incorporate at least some metadata from the <teiheader>
components of the sources.
Load libraries. In addition to the libraries used in Part 2, we also load the os
library because we will be reading input from the file system and the itertools library to help concatenate lists efficiently.
In [11]:
from collatex import *
from lxml import etree
import json,re,os,itertools
We create our own sort function, for which we define linenoRegex
, which includes two capture groups, both of which are strings by default. The first captures all digits from the beginning of the line number (@id
) value. The second captures anything after the numbers. The regex splits the input into a tuple that contains the two values as strings, and we convert the first value to an integer before we return it. For example, the input value '4008a' will return (4008,'a')
, where the '4008' is an integer and the 'a' is a string. We can then specify that our @id
values should be sorted according to the results of processing them with this function. This overcomes the limitation of our being unable to sort them numerically (because some of them contain letters) or alphabetically (because '10' would sort before '9' alphabetically).
In [12]:
def splitId(id):
"""Splits @id value like 4008a into parts, for sorting"""
linenoRegex = re.compile('(\d+)(.*)')
results = linenoRegex.match(id).groups()
return (int(results[0]),results[1])
The WitnessSet class represents all of the witnesses being collated.
Unlike in Parts 1 and 2, where each witness contained just one line (<l>
element), the witnesses now contain multiple lines. We segment the witnesses by @id
value, so that each segment (set of readings to be collated) consists of lines that share an @id
value. To iterate over those values, we need to get a complete list of them, and to ensure that the output is in the correct order, we need to sort them. Lines will be processed individually, segmenting the collation task into subtasks that collate just one line at a time. The all_line_ids()
method returns a list of line identifiers (@id
values) from all witnesses in the correct order.
The generate_json_input()
method returns a JSON object that is suitable for input into CollateX.
In [13]:
class WitnessSet:
def __init__(self,witnessList):
self.witnessList = witnessList
def all_witnesses(self):
"""List of tuples consisting of siglum and contents"""
return [Witness(witness) for witness in self.witnessList]
def all_ids(self):
"""Sorted deduplicated list of all ids in corpus"""
return sorted(set(itertools.chain.from_iterable([witness.XML().xpath('//l/@id') for witness in self.all_witnesses()])),key=splitId)
def get_lines_by_id(self,id):
"""List of tuples of siglum plus <l> element from each witness that corresponds to a certain line"""
witnesses_with_line = []
for witness in self.all_witnesses():
try:
witnesses_with_line.append((witness.siglum,witness.XML().xpath('//l[@id = ' + id + ']')[0]))
except:
pass
return witnesses_with_line
def generate_json_input(self, lineId):
"""JSON input to CollateX for an <l> segment"""
json_input = {}
witnesses = []
for witness in self.get_lines_by_id(lineId):
currentWitness = {}
currentWitness['id'] = witness[0]
currentWitness['tokens'] = Line(witness[1]).tokens()
witnesses.append(currentWitness)
json_input['witnesses'] = witnesses
return json_input
In [14]:
class Witness:
"""Each witness in the witness set is an instance of class Witness"""
def __init__(self,witness):
self.witness = witness
self.siglum = self.witness[0]
self.contents = self.witness[1]
def XML(self):
return etree.XML(self.contents)
The Line class contains methods applied to individual lines. The XSLT stylesheets and the functions to use them have been moved into the Line class, since they apply to individual lines. The siglum for the line is retrieved from the witness that contains it, and is part of the Witness class. The line.tokens()
method returns a list of JSON objects, one for each word token.
In [15]:
class Line:
"""An instance of Line is a line in a witness, expressed as an <l> element"""
addWMilestones = etree.XML("""
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" encoding="UTF-8" omit-xml-declaration="yes"/>
<xsl:template match="*|@*">
<xsl:copy>
<xsl:apply-templates select="node() | @*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<!-- insert a <w/> milestone before the first word -->
<w/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<!-- convert <add>, <sic>, and <crease> to milestones (and leave them that way)
CUSTOMIZE HERE: add other elements that may span multiple word tokens
-->
<xsl:template match="add | sic | crease ">
<xsl:element name="{name()}">
<xsl:attribute name="n">start</xsl:attribute>
</xsl:element>
<xsl:apply-templates/>
<xsl:element name="{name()}">
<xsl:attribute name="n">end</xsl:attribute>
</xsl:element>
</xsl:template>
<xsl:template match="note"/>
<xsl:template match="text()">
<xsl:call-template name="whiteSpace">
<xsl:with-param name="input" select="translate(.,'
',' ')"/>
</xsl:call-template>
</xsl:template>
<xsl:template name="whiteSpace">
<xsl:param name="input"/>
<xsl:choose>
<xsl:when test="not(contains($input, ' '))">
<xsl:value-of select="$input"/>
</xsl:when>
<xsl:when test="starts-with($input,' ')">
<xsl:call-template name="whiteSpace">
<xsl:with-param name="input" select="substring($input,2)"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="substring-before($input, ' ')"/>
<w/>
<xsl:call-template name="whiteSpace">
<xsl:with-param name="input" select="substring-after($input,' ')"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
""")
transformAddW = etree.XSLT(addWMilestones)
xsltWrapW = etree.XML('''
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="no" omit-xml-declaration="yes"/>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates select="w"/>
</xsl:copy>
</xsl:template>
<xsl:template match="w">
<!-- faking <xsl:for-each-group> as well as the "<<" and except" operators -->
<xsl:variable name="tooFar" select="following-sibling::w[1] | following-sibling::w[1]/following::node()"/>
<w>
<xsl:copy-of select="following-sibling::node()[count(. | $tooFar) != count($tooFar)]"/>
</w>
</xsl:template>
</xsl:stylesheet>
''')
transformWrapW = etree.XSLT(xsltWrapW)
def __init__(self,line):
self.line = line
def tokens(self):
return [Word(token).createToken() for token in Line.transformWrapW(Line.transformAddW(self.line)).xpath('//w')]
The Word class contains methods that apply to individual words. unwrap()
and normalize()
are private; they are used by createToken()
to return a JSON object with the "t" and "n" properties for a word token.
In [16]:
class Word:
unwrapRegex = re.compile('<w>(.*)</w>')
stripTagsRegex = re.compile('<.*?>')
def __init__(self,word):
self.word = word
def unwrap(self):
return Word.unwrapRegex.match(etree.tostring(self.word,encoding='unicode')).group(1)
def normalize(self):
return Word.stripTagsRegex.sub('',self.unwrap().lower())
def createToken(self):
token = {}
token['t'] = self.unwrap()
token['n'] = self.normalize()
return token
Create XML data and assign to a witnessSet
variable
Our witnesses are XML files in the 'partonopeus' subdirectory of our current location. Verify that the files are there by listing them.
In [17]:
os.listdir('partonopeus')
Out[17]:
Create a two-member tuple for each file, consisting of two strings: the one-letter identifier (the filename with the '.xml' extension removed) and the contents of the files. Assemble these into a list of tuples and use it to create an instance of the WitnessSet
class, assigned to the variable witnessSet
. We use the lxml library to parse the XML and a file that contains Unicode data must be opened in raw (bytes) mode.
In [18]:
witnessSet = WitnessSet([(inputFile[0],open('partonopeus/' + inputFile,'rb').read()) for inputFile in os.listdir('partonopeus')])
Generate sample JSON from a random line of data and examine it
In [19]:
json_input = witnessSet.generate_json_input('10')
print(json_input)
Collate and output the results of the sample as a plain-text alignment table, as JSON, and as colored HTML
In [20]:
collationText = collate(json_input,output='table')
print(collationText)
collationJSON = collate(json_input,output='json')
print(collationJSON)
collationHTML2 = collate(json_input,output='html2')
In [ ]: